HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746

In this R notebook, we process the raw textual data for our data analysis.

Step 0 - Load all the required libraries

From the packages’ descriptions:

library(tm)
library(tidytext)
library(tidyverse)
library(DT)

Step 1 - Load the data to be cleaned and processed

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)

Step 2 - Preliminary cleaning of text

We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.

corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
  tm_map(content_transformer(tolower))%>%
  tm_map(removePunctuation)%>%
  tm_map(removeNumbers)%>%
  tm_map(removeWords, character(0))%>%
  tm_map(stripWhitespace)
## Warning in as.POSIXlt.POSIXct(Sys.time(), tz = "GMT"): unknown timezone
## 'zone/tz/2018e.1.0/zoneinfo/America/New_York'

Step 3 - Stemming words and converting tm object to tidy object

Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.

stemmed <- tm_map(corpus, stemDocument) %>%
  tidy() %>%
  select(text)

Step 4 - Creating tidy format of the dictionary to be used for completing stems

We also need a dictionary to look up the words corresponding to the stems.

dict <- tidy(corpus) %>%
  select(text) %>%
  unnest_tokens(dictionary, text)

Step 5 - Removing stopwords that don’t hold any significant information for our data set

We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.

data("stop_words")

word <- c("happy","ago","yesterday","lot","today","months","month",
                 "happier","happiest","last","week","past")

stop_words <- stop_words %>%
  bind_rows(mutate(tibble(word), lexicon = "updated"))

Step 6 - Combining stems and dictionary into the same tibble

Here we combine the stems and the dictionary into the same “tidy” object.

completed <- stemmed %>%
  mutate(id = row_number()) %>%
  unnest_tokens(stems, text) %>%
  bind_cols(dict) %>%
  anti_join(stop_words, by = c("dictionary" = "word"))

Step 7 - Stem completion

Lastly, we complete the stems by picking the corresponding word with the highest frequency.

completed <- completed %>%
  group_by(stems) %>%
  count(dictionary) %>%
  mutate(word = dictionary[which.max(n)]) %>%
  ungroup() %>%
  select(stems, word) %>%
  distinct() %>%
  right_join(completed) %>%
  select(-stems)

Step 8 - Pasting stem completed individual words into their respective happy moments

We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.

completed <- completed %>%
  group_by(id) %>%
  summarise(text = str_c(word, collapse = " ")) %>%
  ungroup()

Step 9 - Keeping a track of the happy moments with their own ID

hm_data <- hm_data %>%
  mutate(id = row_number()) %>%
  inner_join(completed)

datatable(hm_data)

Exporting the processed text data into a CSV file

write_csv(hm_data, "../output/processed_moments.csv")

The final processed data is ready to be used for any kind of analysis.